Thomas Lumley
6-7 July 2016
Analytic modelling of survey data is important, over and above just tabulating
People will do it better if they can use most of the same approaches they use for cohort or panel data.
Tools for exploratory analysis are still a gap in many texts and most software. 
Supply replicate weights instead of design meta-data: use svrepdesign()
Or, create them from a design object: use as.svrepdesign() (JK1, JKn, BRR, bootstraps)
Various PPS linearisation estimators also available
barplot(svyby(~enroll,~stype,design=dstrat,svymean),
xlab="Mean school size",col="goldenrod")library(rmeta)
means<-svyby(~enroll,~stype,design=dstrat,svymean)
metaplot(coef(means),SE(means),labels=names(coef(means)),
nn=1,boxsize=0.2,
xlab="Mean school size",ylab="School type")Usually choose default bandwidth as if SRS of same size.
cdf.est<-svycdf(~enroll+api00, dstrat)
cdf.pop<-ecdf(apipop$enroll)
cdf.samp<-ecdf(apistrat$enroll)
plot(cdf.pop,main="Population vs sample", xlab="Enrollment")
lines(cdf.samp,col="red",lwd=2)
lines(cdf.est[["enroll"]],col.points="forestgreen",lwd=2)svyboxplot(enroll~stype, design=dclus2, col="orange", all.outliers=TRUE)plot(svysmooth(~api00, dclus2))svyhist(~api00, dclus2, col="orange", main="")Complete code on github/tslumley/regression-paper
nhanes<- read.csv("combined-data.csv")
des<-svydesign(id=~SDMVPSU,strat=~SDMVSTRA,weights=~fouryearwt,
nest=TRUE, data=subset(nhanes, !is.na(WTDRD1)))
des<-update(des, sodium=DR1TSODI/1000, potassium=DR1TPOTA/1000)
des<-update(des, namol=sodium/23, kmol=potassium/39)
des## Stratified 1 - level Cluster Sampling design (with replacement)
## With (60) clusters.
## update(des, namol = sodium/23, kmol = potassium/39)
dim(des)## [1] 18383 100
plot(BPXDAR~RIDAGEYR,data=nhanes, xlab="Age",ylab="Diastolic",pch=19)Hard because
Approaches:
Use partially-transparent points:
svyplot(BPXDAR~RIDAGEYR,design=des, xlab="Age",ylab="Diastolic",style="trans",pch=19)svyplot(BPXSAR~RIDAGEYR,design=des, xlab="Age",ylab="Systolic",style="trans",pch=19,
basecol=function(df){ifelse(df$RIAGENDR==1,"blue","pink")})(Dan Carr, 1987)
svyplot(BPXDAR~RIDAGEYR,design=des, xlab="Age",ylab="Diastolic",style="hex",legend=0)We only need the curve, not a standard error estimate, so this is easy
l10<-svysmooth(BPXSAR~RIDAGEYR,design=des, method="quantreg",df=5,quantile=.1)
l25<-svysmooth(BPXSAR~RIDAGEYR,design=des, method="quantreg",df=5,quantile=.25)
l50<-svysmooth(BPXSAR~RIDAGEYR,design=des, method="quantreg",df=5,quantile=.5)
l75<-svysmooth(BPXSAR~RIDAGEYR,design=des, method="quantreg",df=5,quantile=.75)
l90<-svysmooth(BPXSAR~RIDAGEYR,design=des, method="quantreg",df=5,quantile=.9)
plot(BPXSAR~RIDAGEYR,data=nhanes,type="n",xlab="Age",ylab="Systolic")
lines(l10,lty=3)
lines(l25,col="grey",lwd=2)
lines(l50,lwd=2)
lines(l75,col="grey",lwd=2)
lines(l90,lty=3)Show relationships in more than two dimensions by plotting Y~X conditioned on a range of Z
svycoplot(BPXSAR~BPXDAR|cut(RIDAGEYR,c(0,21,40,60,100))*factor(RIAGENDR,labels=c("M","F")),design=des, xlab="Diastolic",ylab="Systolic",style="hex",xbins=30)svycoplot(BPXSAR~BPXDAR|cut(RIDAGEYR,c(0,21,40,60,100)),design=des, xlab="Age",ylab="Systolic",style="trans",
basecol=function(df){ifelse(df$RIAGENDR==1,"blue","pink")})
library(hextri)
dd<-model.frame(des)[,c("RIDAGEYR","BPXSAR","fouryearwt","RIAGENDR")]
dd<-subset(na.omit(dd), BPXSAR>50 & BPXSAR<200)
hextri(BPXSAR~RIDAGEYR, weights=fouryearwt, class=RIAGENDR,nbins=40,
style="size",col=c("orange","purple"),data=dd)Instead of solving the Normal equations \[X^T(Y-\mu)=0\] we solve weighted Normal equations \[X^TW(Y-\mu)=0\] where \(W=\mathrm{diag}(\pi_i^{-1})\)
Under no assumptions about \(Y|X\), \(\hat\beta\) is consistent for the population least-squares line.
Variances from delta-method (‘sandwich’) or resampling.
For inference about population associations:
If \(E[Y|X=x]=x\beta\), so the model is correctly specified and the weights are independent of \(Y\) given \(X\)
Bias/variance tradeoff: the larger the survey, the more we care about bias, so the more we want to include the weights
If \(E[Y|X=x]=x\beta\), but weights do not just depend on \(x\), can replace weights by \(w_i = g(x_i)/\pi_i\) for any function \(g\).
Optimal \(g()\) minimises the coefficient of variation of \(w_i\), not far from \(g(x)=E[\pi_i^{-1}|X=x]\)
(Pfefferman & Sverchkov, 1999; Brumback, Hernán, and Robins, 2000: “stabilised weights”)
Useful when \(\pi_i\) depends strongly on \(x\), weakly on \(Y\).
Same principle for generalised linear models: weighted likelihood equation \[D^TV^{-1}W(Y-\mu)=0\]
Similar principle for Cox model (Binder, 1992), loglinear models (Rao & Scott, 1981), proportional odds model, parametric survival models, etc, etc.
Main exception is mixed models: these model pairs of observations, so can’t just reweight with \(\pi_i\).
svyglm() for linear and generalised linear modelssvycoxph() for Cox proportional hazards modelsvyolr() for proportional odds and other cumulative link modelssvyloglin() for loglinear models of contingency tablessvymle() or withReplicates() for adding your own.All take a model formula and a survey object, plus other options.
Data on blood pressure and diet from the US NHANES health survey.
Complex four-stage survey, but public-use data approximates by two-stage design.
nhanesdes <- svydesign(id=~SDMVPSU, strata=~SDMVSTRA,
weights=~fouryearwt, nest=TRUE
data=subset(nhanes, !is.na(WTDRD1)))
nhanesdes <- update(nhanesdes, sodium=DR1TSODI/1000
potassium=DR1TPOTA/1000)
nhanesdes <- update(nhanesdes, namol = sodium/23,
kmol= potassium/23)
nhanesdes <- update(nhanesdes, htn = (BPXSAR>140) | (BPXDAR>90))
coef(summary(model<-svyglm(BPXSAR~RIAGENDR+RIDAGEYR+factor(RIDRETH1)
+BMXBMI+sodium+potassium,
design=nhanesdes)))## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 97.0461695 1.38920535 69.8573247 2.353878e-26
## RIAGENDR -3.3704785 0.38506925 -8.7529152 1.890904e-08
## RIDAGEYR 0.4650764 0.01144886 40.6220626 1.904598e-21
## factor(RIDRETH1)2 0.2377258 1.35465032 0.1754886 8.623768e-01
## factor(RIDRETH1)3 -0.5100238 0.62819669 -0.8118855 4.259650e-01
## factor(RIDRETH1)4 3.0297383 0.64395692 4.7048773 1.206576e-04
## factor(RIDRETH1)5 1.2946580 0.88675393 1.4599969 1.590902e-01
## BMXBMI 0.3710158 0.03806145 9.7478105 3.024477e-09
## sodium 0.4287925 0.16189591 2.6485690 1.502436e-02
## potassium -0.8499141 0.17133145 -4.9606429 6.577844e-05
Is the relationship nonlinear?
termplot(model,terms=5,partial=TRUE,smooth=panel.smooth)library(splines)
model2<-svyglm(BPXSAR~RIAGENDR*ns(RIDAGEYR,4)+factor(RIDRETH1)
+BMXBMI+sodium+potassium,
design=nhanesdes)
coef(summary(model2))[c("sodium","potassium"),]## Estimate Std. Error t value Pr(>|t|)
## sodium 0.3081594 0.1567208 1.966295 0.0694138309
## potassium -0.7229498 0.1636270 -4.418278 0.0005839182
No real change. Weak association may be due to measurement error.
AIC(model,model2)## eff.p AIC deltabar
## [1,] 4859.923 3230615 539.9914
## [2,] 7586.726 3103323 474.1704
regTermTest(model2, ~sodium+potassium)## Wald test for sodium potassium
## in svyglm(formula = BPXSAR ~ RIAGENDR * ns(RIDAGEYR, 4) + factor(RIDRETH1) +
## BMXBMI + sodium + potassium, design = nhanesdes)
## F = 9.770038 on 2 and 14 df: p= 0.0022077
regTermTest(model2, ~factor(RIDRETH1),method="Wald")## Wald test for factor(RIDRETH1)
## in svyglm(formula = BPXSAR ~ RIAGENDR * ns(RIDAGEYR, 4) + factor(RIDRETH1) +
## BMXBMI + sodium + potassium, design = nhanesdes)
## F = 14.83965 on 4 and 14 df: p= 0.000061444
regTermTest(model2, ~factor(RIDRETH1),method="LRT")## Working (Rao-Scott+F) LRT for factor(RIDRETH1)
## in svyglm(formula = BPXSAR ~ RIAGENDR * ns(RIDAGEYR, 4) + factor(RIDRETH1) +
## BMXBMI + sodium + potassium, design = nhanesdes)
## Working 2logLR = 41.46407 p= 0.00062401
## (scale factors: 1.5 1.3 0.75 0.46 ); denominator df= 14
Basic idea: Rao & Scott (1981,1984) work out the sampling distribution of the pseudolikelihood ratio and the Pearson \(X^2\) score statistic in contingency tables
Lumley & Scott (2015) extend this to generalised linear models with arbitrary covariates.
Quadratic forms have a known (messy, but computable) distribution
AIC is
Rather than \(-2\ell+2p\), use \(-2\hat\ell+2\hat\Delta\), where \(\hat\Delta\) is the trace of a design-effect matrix.
Under iid sampling, correct model, eigenvalues are 1 or 0 and \(\Delta=p\).
Under complex sampling, we can estimate it: if \(A^{-1}BA^{-1}\) is sandwich estimator, \(\hat\Delta\) is trace of \(A^{-1}B\)
Can’t just use pseudolikelihood because BIC is about posterior probabilities, not allowed to cheat.
Comes out as a penalised Wald statistic, with penalty \(p\log n^*\) for an effective sample size \(n^*\).
(Lumley & Scott, J. Survey Stat Methodology, 2016)
R has some maps, can handle shapefiles
brfss<-update(brfss, agegp=cut(AGE, c(0,35,50,65,Inf)))
hlth<-svyby(~I(HLTHPLAN==1), ~agegp+X_STATE, svymean,
design=brfss)
hlthdata<-reshape(hlth[,c(1,2,4)],idvar="X_STATE",
direction="wide",timevar="agegp")
names(hlthdata)[2:5]<-paste("age",1:4,sep="")
states@data<-merge(states,hlthdata,
by.x="ST_FIPS",by.y="X_STATE",all=FALSE)
spplot(states,c("age1","age2","age3","age4"),
names.attr=c("<35","35-50","50-65","65+"))